Across Languages and Genres: Creating a Universal Annotation Scheme for Textual Relations
نویسندگان
چکیده
The present paper describes an attempt to create an interoperable scheme using existing annotations of textual phenomena across languages and genres including non-canonical ones. Such a kind of analysis requires annotated multilingual resources which are costly. Therefore, we make use of annotations already available in the resources for English, German and Czech. As the annotations in these corpora are based on different conceptual and methodological backgrounds, we need an interoperable scheme that covers existing categories and at the same time allows a comparison of the resources. In this paper, we describe how this interoperable scheme was created and which problematic cases we had to consider. The resulting scheme is supposed to be applied in the future to explore contrasts between the three languages under analysis, for which we expect the greatest differences in the degree of variation between non-canonical and canonical language. 1 Aims and Motivation The aim of the present study is to create a scheme which will allow us to use existing annotations of textual phenomena, and which will be applicable to multiple languages and genres, including noncanonical ones. The annotations were created within two separate projects: German-English Contrasts in Cohesion (GECCo, Lapshinova and Kunz (2014)) whose focus was on English and German on the one hand, and the Prague Dependency Treebank (PDT 3.0, Bejček et al. (2013)) with the analysis of Czech, on the other hand. The resulting scheme will serve our overarching goal to unify the two approaches in a joint analysis of contrasts between English, German and Czech on the level of discourse. We assume that the greatest differences between these languages lie in the degree of variation between non-canonical and canonical language (here we especially mean spoken language). Previous findings on lexico-grammatical and also cohesive phenomena have evidenced that there is more variation between written and spoken dimensions in German than in English, even though they are closely related, cf. Mair (2006) or Kunz et al. (forthcoming). Studies with respect to spoken and written Czech (see, e.g., Cvrček et al. (2010)) suggest that the differences between written and spoken language are even more pronounced in Czech than in German, at least with respect to lexico-grammar, hence we expect that this also holds for the level of text/ discourse. We therefore suggest that if we draw a line of differences between spoken and written English, German and Czech, we would observe a continuum in the degree of variation between these languages, as seen in Figure 1. The graph also reflects the above assumption that the differences are less pronounced between English and German than if we compare English and German with Czech. The reasons for this lie in the linguistic heritage of these languages (English and German have a common WestGermanic origin while Czech is a Slavic language) and in sociolinguistic factors that influenced their evolution (for example, Czech purism at the beginning of the 20th century, described, e.g., in Havránek and Weingart (1932)). To our knowledge, there is no
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملExperiments on bridging across languages and genres
In this paper, we introduce a typology of bridging relations applicable to multiple languages and genres. After discussing our annotation guidelines, we describe annotation experiments on the German part of our parallel coreference corpus and show that our interannotator agreement results are reliable, considering both antecedent selection and relation assignment. In order to validate our theor...
متن کاملAnnotating Attribution Relations across Languages and Genres
In Pareti (2012) I presented an approach to the annotation of attribution defining it as a relation intertwined albeit independent from other linguistic levels and phenomena. While a portion of this relation can be identified at the syntactic level (Skadhauge and Hardt, 2005) and part of it can overlap with the argument of discourse connectives (Prasad et al., 2006), attribution is best represe...
متن کاملAnalysis and Reference Resolution of Bridge Anaphora across Different Text Genres
We discuss bridge relations in Dutch between two textual referents across six di↵erent text genres. After briefly presenting the annotation guidelines and inter-annotation agreement results, we conduct an in-depth manual analysis of the di↵erent types of bridge relations found in our data sets. This analysis reveals that for all genres bridging references stand mostly in a class relationship, w...
متن کاملUniversal Dependencies for Dargwa Mehweb
The Universal Dependencies (UD) project aims to create the unified annotation schemes across languages. With its own annotation principles and abstract inventory for parts of speech, morphosyntactic features and dependency relations, UD aims to facilitate multilingual parser development, crosslingual learning, and parsing research from a language typology perspective. This paper provides the de...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015